Add support for structured parameters #47

klueska · 2024-04-29T12:33:16Z

This PR adds support for structured parameters to the example DRA driver for Kubernetes v1.30 (removing the support for "classic" DRA in the process). A new branch called classic-dra has been created to preserve support for it.

Closes #44

klueska · 2024-04-29T12:34:33Z

/cc @pohly @bart0sh @aojea @asm582

k8s-ci-robot · 2024-04-29T12:34:37Z

@klueska: GitHub didn't allow me to request PR reviews from the following users: asm582.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @pohly @bart0sh @aojea @asm582

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

api/example.com/resource/gpu/v1alpha1/api.go

pohly · 2024-04-29T14:42:47Z

README.md

@@ -117,11 +136,48 @@ Spec:
 ...
 ```

+**Note**: When running with `StructuredParameters` you can also check that the


Does it make sense to publish NodeAllocationState when running with structured parameters?

That's not going to get updated, is it?

Hmm, it did get updated. But how? Is that something that the kubelet plugin does?

I can imagine that it was easy to do this because the code was already there, but as an example of how to write a DRA driver I find it confusing.

I tried to make the minimal changes necessary to support both classic DRA and structured parameters in the same code base. It would definitely look different if this weren't a goal. In fact, I would probably move all the state of already-prepared claims out of the CRD and instead checkpoint in a file on the node. Is it worth making this separation in this PR into two separate plugins / controllers that get deployed depending on the withStructuredParameters flag?

It would make the code more realistic for folks who only want to do structured parameters. I think it's worthwhile.

This has now been updated to strip out support for classic DRA

pohly · 2024-04-29T14:48:43Z

cmd/dra-example-controller/claimparametersgen.go

+
+	klog.Infof("Created ResourceClaimParameters for GpuClaimParameters %s/%s", namespace, gpuClaimParameters.Name)
+	return nil
+}


This is a non-trivial amount of code. Would it be worth adding unit tests or do we leave that as an exercise to the reader?

The demo works.

It's always worth it to add unit tests :)

I'm going to skip unit tests for now since this goes away in 1.31 anyway.

pohly · 2024-04-29T14:49:55Z

README.md

 Next, deploy four example apps that demonstrate how `ResourceClaim`s,
 `ResourceClaimTemplate`s, and custom `ClaimParameter` objects can be used to
 request access to resources in various ways:
 ```bash
-kubectl apply --filename=gpu-test{1,2,3,4}.yaml
+kubectl apply --filename=demo/gpu-test{1,2,3,4}.yaml


Showing that parameters got converted when using structured parameters would be useful here. OTOH, it's not very impressive right now. So perhaps not (yet)?

Yes, it's not very interesting yet ... I will definitely add this once I open a PR introduce selection via a vendor-specific CRD.

I will wait to do this as part of the 1.31 update

Signed-off-by: Kevin Klues <[email protected]>

README.md

cmd/dra-example-controller/claimparametersgen.go

bart0sh · 2024-07-15T08:45:24Z

cmd/dra-example-kubeletplugin/driver.go

 }

 func (d *driver) NodePrepareResources(ctx context.Context, req *drapbv1.NodePrepareResourcesRequest) (*drapbv1.NodePrepareResourcesResponse, error) {
-	logger := klog.FromContext(ctx)
-	logger.Info("NodePrepareResource", "numClaims", len(req.Claims))
+	klog.Infof("NodePrepareResource is called: number of claims: %d", len(req.Claims))


Just wondering why do you prefer using klog directly over context logger?

bart0sh · 2024-07-15T08:48:10Z

cmd/dra-example-kubeletplugin/driver.go

-		return nil
-	})
-
+	prepared, err := d.state.Prepare(claim.Uid, allocated)


This looks like claim preparation logic is implemented it state. I expected most of it here, in the driver.go

What does that leave for state to do then? Are you suggesting I get rid of the state indirection altogether?

It would leave at least checkpointing and probably other state-related operations. The actual device preparation should be decoupled from the state management in my opinion.

I'm not sure how I feel about pulling the list of allocatable devices up into the driver -- this is the main thing that the state abstraction maintains, and "preparation" at this level is basically checking if the allocated devices are actually in the allocatable list, checkpointing them, and then returning the set of CDI devices associated with them. Maybe you just don't like the name prepare() for this lower-level operation? Do you have another suggestion on the name.

I had an impression that all resource preparation code is in the state.prepare. If this is not the case, than I'm ok with this.

bart0sh · 2024-07-15T09:02:08Z

looks good to me except of a couple of nits I've mentioned above.

One thing that I've noticed that the device preparation is handled by the driver state. I'd suggest to move it out of there to the driver. I understand that for the demo driver it can be empty, but it shouldn't be in the state implementation in my opinion.

Signed-off-by: Kevin Klues <[email protected]>

bart0sh · 2024-07-17T05:49:10Z

README.md

+  metadata:
+    creationTimestamp: "2024-04-17T13:45:44Z"
+    generateName: dra-example-driver-cluster-worker-gpu.resource.example.com-
+    name: dra-example-driver-cluster-worker-gpu.resource.example.comxktph


Would it make sense to separate random suffix from the name, e.g.
dra-example-driver-cluster-worker-gpu.resource.example.comxktph -> dra-example-driver-cluster-worker-gpu.resource.example.com.xktph ?

That is a question for @pohly -- this comes from his library for generating the name

The GenerateName: dra-example-driver-cluster-worker-gpu.resource.example.com- already has a trailing -. But as that prefix is long, Kubernetes truncates it and drops the trailing -.

One way to avoid that would be to truncate the <node>-<driver>- string in advance if it is too long, for example by cutting out some characters in the middle of each sub-component. Would that be better?

this comes from his library for generating the name

Just to be clear: that library produces generateName, not the name - that is the name generated by Kubernetes.

byako

/lgtm

byako · 2024-07-23T14:21:21Z

cmd/dra-example-controller/claimparametersgen.go

+		AddFunc: func(obj any) {
+			unstructured, ok := obj.(*unstructured.Unstructured)
+			if !ok {
+				klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)


Suggested change

klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)

klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)

return

Otherwise unstructured == nil is used on line 109

byako · 2024-07-23T14:21:57Z

cmd/dra-example-controller/claimparametersgen.go

+		UpdateFunc: func(oldObj any, newObj any) {
+			unstructured, ok := newObj.(*unstructured.Unstructured)
+			if !ok {
+				klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj)


Suggested change

klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj)

klog.Errorf("Error converting object to *unstructured.Unstructured: %v", newObj)

return

byako · 2024-07-23T14:45:44Z

cmd/dra-example-kubeletplugin/state.go

 	return cdiDevices, nil
 }

 func (s *DeviceState) Unprepare(claimUID string) error {
 	s.Lock()
 	defer s.Unlock()

-	if s.prepared[claimUID] == nil {
+	checkpoint := newCheckpoint()


Why does the checkpoint need to be loaded every time unprepare or prepare is called? Is it some sort of protection from kubelet-plugin Pod being offline / suspended temporarily? If not, my impression was that the checkpoint can be loaded just once when the kubelet-plugin object is being created and new state object is created. After state is ready, then the prepared claims need to be saved into the checkpoint every time Prepare or Unprepare is called trusting that State object is the source of truth, and ignoring old data in snapshots.

k8s-ci-robot · 2024-07-23T14:51:12Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: byako, klueska

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [byako,klueska]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from byako and pohly April 29, 2024 12:33

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 29, 2024

k8s-ci-robot requested a review from bart0sh April 29, 2024 12:34

k8s-ci-robot requested a review from aojea April 29, 2024 12:34

klueska force-pushed the structured-parameters branch 3 times, most recently from ab745f7 to f848603 Compare April 29, 2024 12:40

pohly reviewed Apr 29, 2024

View reviewed changes

api/example.com/resource/gpu/v1alpha1/api.go Show resolved Hide resolved

pohly reviewed Apr 29, 2024

View reviewed changes

klueska added 5 commits July 10, 2024 15:39

Update go dependencies to kubernetes v1.30.2

a064d04

Signed-off-by: Kevin Klues <[email protected]>

Update to kindest/node:v1.30.2

6b1df1b

Signed-off-by: Kevin Klues <[email protected]>

Add support to advertise allocatable GPUs as a ResourceSlice

f35a0cd

Signed-off-by: Kevin Klues <[email protected]>

Add support for {Prepare,Unprepare}Resources with stuctured parameters

c82ba30

Signed-off-by: Kevin Klues <[email protected]>

Turn the GpuClaimParameters.Count field into a pointer type

41ca66b

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the structured-parameters branch from f848603 to 1d1e362 Compare July 10, 2024 23:45

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 10, 2024

klueska force-pushed the structured-parameters branch 3 times, most recently from 19178d5 to b6b1206 Compare July 11, 2024 09:08

bart0sh reviewed Jul 15, 2024

View reviewed changes

README.md Show resolved Hide resolved

bart0sh reviewed Jul 15, 2024

View reviewed changes

README.md Show resolved Hide resolved

bart0sh reviewed Jul 15, 2024

View reviewed changes

README.md Show resolved Hide resolved

klueska force-pushed the structured-parameters branch from b6b1206 to 1e3c484 Compare July 15, 2024 08:24

bart0sh reviewed Jul 15, 2024

View reviewed changes

klueska added 5 commits July 16, 2024 11:27

Add ability to generate ResourceClaimParameters from GpuClaimParameters

f46eea8

Signed-off-by: Kevin Klues <[email protected]>

Force ResourceClass deployed with helm to use structured parameters

0a0a8bd

Signed-off-by: Kevin Klues <[email protected]>

Remove support for "classic" DRA

62a4696

Signed-off-by: Kevin Klues <[email protected]>

Add support for checkpointing prepared devices by kubelet plugin

8c7f2a7

Signed-off-by: Kevin Klues <[email protected]>

Update README with changes for structured parameters

49c8d0a

Signed-off-by: Kevin Klues <[email protected]>

klueska force-pushed the structured-parameters branch from 1e3c484 to 49c8d0a Compare July 16, 2024 11:28

bart0sh reviewed Jul 17, 2024

View reviewed changes

byako approved these changes Jul 23, 2024

View reviewed changes

k8s-ci-robot assigned byako Jul 23, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 23, 2024

k8s-ci-robot merged commit 6e99295 into main Jul 23, 2024
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for structured parameters #47

Add support for structured parameters #47

klueska commented Apr 29, 2024 •

edited

Loading

klueska commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

pohly Apr 29, 2024

pohly Apr 29, 2024

klueska Apr 29, 2024

pohly Apr 29, 2024

klueska Jul 10, 2024

pohly Apr 29, 2024

klueska Apr 29, 2024

klueska Jul 11, 2024

pohly Apr 29, 2024

klueska Apr 29, 2024

klueska Jul 11, 2024

bart0sh Jul 15, 2024

bart0sh Jul 15, 2024

klueska Jul 16, 2024

bart0sh Jul 16, 2024

klueska Jul 16, 2024 •

edited

Loading

bart0sh Jul 17, 2024

bart0sh commented Jul 15, 2024

bart0sh Jul 17, 2024

klueska Jul 17, 2024

pohly Jul 17, 2024

pohly Jul 17, 2024

byako left a comment

byako Jul 23, 2024

byako Jul 23, 2024

byako Jul 23, 2024

k8s-ci-robot commented Jul 23, 2024

	klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)
	klog.Errorf("Error converting object to *unstructured.Unstructured: %v", obj)
	return

Add support for structured parameters #47

Add support for structured parameters #47

Conversation

klueska commented Apr 29, 2024 • edited Loading

klueska commented Apr 29, 2024

k8s-ci-robot commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

klueska Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bart0sh commented Jul 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byako left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 23, 2024

klueska commented Apr 29, 2024 •

edited

Loading

klueska Jul 16, 2024 •

edited

Loading